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Abstract. Multi-way data arises in many applications such as EEG classification, face recognition, text mining and 
hyperspectral data analysis. Tensor decomposition has been commonly used to find the hidden factors and elicit the intrinsic 
structures of the multi-way data. This paper considers sparse nonnegative Tucker decomposition (NTD), which is to decompose 
a given tensor into the product of a core tensor and some factor matrices with sparsity and nonnegativity constraints. An 
alternating proximal gradient method (APG) is applied to solve the problem by two different orders of updating the core tensor 
and factor matrices. The algorithm is then modified to sparse NTD with missing values. Per-iteration cost of the algorithm 
is estimated scalable about the data size, and global convergence is established under fairly loose conditions. Numerical 
experiments on both synthetic and real world data demonstrate its superiority over some state-of-the-art methods for (sparse) 
NTD from partial and/or full observations. The Matlab code along with some demos are accessible from the author's homepage. 
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1. Introduction. A tensor is a multi-dimensional array. For example, a vector is a first-order tensor, 
and a matrix is a second-order tensor. The order of a tensor is the number of dimensions, also called way 
or mode. Tensors naturally arise in the applications that collect data along multiple dimensions, including 
space, time, and spectrum, from different subjects (e.g., patients), under varying conditions, and in different 
modalities. Examples include medical data (CT, MRI, EEG), text data and hyperspectral images. An 
efhcient approach to elicit the intrinsic structure of multi-dimensional data is tensor decomposition. Two 
commonly used tensor decompositions are CANDECOMP/PARAFAC decomposition (CPD) [6, 11] and 
Tucker decomposition (TD) [34], also called higher-order singular value decomposition (HOSVD) [7]. CPD 
decomposes an iVth-order tensor into the product of N factor matrices Ai, • • • , Aat, and TD decomposes 

into the product of a core tensor C and N factor matrices Ai, • • • , A^r. CPD can be regarded as a special 
case of TD by restricting C to a super-identity X, which is an extension of the identity matrix and has all 
one's on its superdiagonal and all zero's off the superdiagonal. 

This paper focuses on sparse nonnegative Tucker decomposition (NTD) [18], which uses nonnegativity 
constraints and some sparsity regularizers on the core tensor and/or factor matrices. Nonnegativity allows 
only additivity, so the solutions are often intuitive to understand and explain. Promoting the sparsity of 
the core tensor aims at improving the interpretability of the solutions. Roughly speaking, the core tensor 
interacts with all the factor matrices, and a simple one is often preferred [13]. Consider a three-order tensor, 
for example. The (1, 1, l)-th component of the core tensor couples the first columns of three factor matrices 
together. If it is not zero, then the three columns interacts with each other. Otherwise, they have no or only 
weak relations. Forcing the core tensor to be sparse can often keep strong interactions between the factor 
matrices and remove the weak ones. Sparse factor matrices make the decomposed parts more meaningful and 
can enhance uniqueness as explained in [26] . Sparse NTD has found a large number of applications such as 
in computer vision [32] , EEG classification [22] , hyperspectral data analysis [40] , social network analysis [28] , 
text mining [26], face recoginition [39], and so on. 

NTD includes nonnegative CP decomposition (NCP) [35] as a special case by fixing the core tensor 
to a super-identity. Use of a core tensor makes NTD more flexible and suitable for applications with very 
unbalanced data, while NCP may not work well for this kind of data. For instance, when the given tensor is 
5 X 5 X 1000, using a 3 x 3 x 3 may lead to large fitting error, but a 100 x 100 x 100 core tensor is obviously 
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redundant in the first and second modes. However, the core tensor makes NTD more difficult than NCP 
to solve since the core tensor relates to all factor matrices and the update of each factor matrix requires 
tensor-matrix multiplications. Also, for many algorithms such as alternating least squares, the update of the 
core tensor is extremely expensive and makes these algorithms not suitable for applications with large-scale 
data. 

1.1. Related work. NTD is a highly non-convex problem, and sparse regularizers make the problem 
even harder. A natural and often efficient way to solve the problem is to alternatively update the core 
tensor and factor matrices. It includes, but not limited to, alternating least squares method (ALS) [10], 
column-wise coordinate descent (CCD) [24], multiplicative update (Mult) [26], and hierarchical alternating 
least squares (HALS) [29]. ALS alternatively updates the core tensor and factor matrices by solving a 
sequence of nonnegative least squares problems, which needs to do matrix inverse and make ALS unsuitable 
for large-scale problems. For this reason, [10] simply restricts the core tensor to be super-diagonal in its 
numerical tests. CCD has closed form update for each column of a factor matrix. However, it still needs to 
solve a huge nonnegative least squares problem to update the core tensor, which makes CCD unsuitable for 
large-scale problems either. Mult is an extension of the multiplicative update method in [21] for nonnegative 
matrix factorization [20,27] and has a relatively low per-iteration cost. At each iteration, it only needs some 
tensor-matrix multiplications and component-wise divisions. The drawback of Mult is its slow convergence, 
which makes the algorithm often run a large number of iterations to reach an acceptable data fitting. 
Like ALS, HALS needs to solve a sequence of nonnegative least squares problems, but it updates factor 
matrices in a column-wise way and the core tensor component-wisely, which enables closed form solutions 
for all subproblems. In addition, HALS often converges faster than Mult. However, as shown in [30], the 
convergence speed of HALS is still not satisfying. 

There are also algorithms such as the damped Gauss-Newton method (dGN) in [30] that updates the 
core tensor and factor matrices simultaneously. It is demonstrated that dGN overwhelmingly outperforms 
Mult and HALS in terms of convergence speed. The bottleneck of this method is the computation of Hessian 
inverse or how to solve a huge linear system, although the authors carefully explained a fast computation of 
an approximate Hessian and gradient at each iteration. 

Recently, [37] proposed an alternating proximal gradient method (APG) for solving NCP, and it was 
observed superior to some other algorithms such as the alternating direction method of multiplier (ADMM) 
[41] and alternating nonnegative least squares method (ANLS) [14, 16] in both speed and solution quality. 
Unlike ANLS that exactly solves each subproblem, APG updates every factor matrix by solving a relaxed 
subproblem with a separable quadratic objective. Each relaxed subproblem has a closed form solution, which 
makes low per-iteration cost. Using an extrapolation technique, APG also converges very fast. Moreover, 
the method is proved to globally converge to a stationary point under some boundedness assumptions on its 
iterates. This is an important theoretical result that many other methods such as ADMM and ANLS have 
not been proved to own. 

1.2. Contributions. We apply the APG method proposed in [37] to sparse NTD. First, we update the 
core tensor C and factor matrices Ai, • • • , A^v cyclically, namely, in the order of C, Ai, • • • , Aat. Then we im- 
prove the algorithm by updating the core tensor and factor matrices in the order of C, Ai, C, A2, • • • , C, A^r, 
which can often largely speed up the convergence of the algorithm. Global convergence is established for 
both the original and the improved algorithms. We also analyze the per-iteration complexity of the algo- 
rithms, and it is estimated scalable about the data size if the core size is small. In addition, we modify 
the algorithms to sparse NTD with missing values and consider some extensions of NTD including sparse 
higher-order principal component analysis [1]. The algorithms are carefully implemented in Matlab and 
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compared to some state-of-the-art methods for solving (sparse) NTD from partial and/or full observations 
on both synthetic and real world data. Numerical results show that the proposed algorithms make superior 
performance over other compared ones. 

1.3. Notation. We use small letters a, a;, • • • to denote scalars, bold small letters a, x, y, • • • to represent 
vectors, bold capital letters A, B, • • • for matrices and bold caligraphic letters C, Ad, X, - ■ ■ for tensors. The 
components of a tensor ^ are written in the form of a;ij^i2...ijy, which denotes the (11,^2, • • • , «Ar)-th component 
of Af. 

1.4. Outline. The rest of the paper is organized as follows. We overview some tensor related concepts 
and properties in Section 2 and the APG method for multi-convex problems in Section 3. Section 4 applies 
APG to sparse NTD problem. The per-iteration complexity and convergence results of the algorithm are 
given. Section 5 improves the APG method. Then the algorithms are modified for sparse NTD with missing 
values in Section 6. Some extensions of sparse NTD are considered in Section 7. Numerical results are shown 
in Section 8. Finally, Section 9 concludes the paper. 

2. Overview of tensor. Below we list some tensor related concepts, which will be used in the for- 
mulation of our problems and the derivation of the algorithm. For more details about tensor, see the nice 
review paper [19] and the references therein. 

• fiber: A fiber of a tensor X is a vector obtained by fixing all indices of X except one. For example, 
a column of a matrix is a mode-1 fiber (the 2nd index is fixed), and a row is a mode-2 fiber (the 1st 
index is fixed). We use to denote a mode-n fiber of an A^th-order tensor X. 

• vectorization: The vectorization of a tensor X gives a vector, which is obtained by stacking all 
mode-1 fibers of X and denoted by vec(Ar). 

• matricization: The mode-n matricization of a tensor AT is a matrix denoted by X(„) whose columns 
are mode-ri fibers of X in the lexicographical order. Specifically, if AT g '^Iix-'-xIn ^ j^g niode-n fiber 
Xij...i^_^:i^^^...ijY becomes the j-th column of X(„), where 

fc-i 

J = l + ^(*fe-l) n Irn- 
k^n m=l 

• tensor-matrix product: The mode-n product of a tensor AT G jj^i x • • • x withamatrixA e M'^^^" 
is written as AT x„ A, and the result is an /i x • • • x /„_i x J x In+i x • • • x /^r tensor defined 
component-wisely by 

In 

• inner product: The inner product of two tensors A.,B G rIix---xIn (j^gfine;^ ^j^g same way as 
that of two matrices, namely, 

{A,B) = ai^...,^h^...,^. 

The Frobenious norm of a tensor X is defined as ||A^||f — \/ {X , X). 

• Tucker decomposition: Given a tensor G jjA x - x/n ^ ^j^g tucker decomposition of is to 
find a core tensor C G M^ix ••xi?,v ^^^^ factor matrices A„ G M^"^^",n = 1, • • • ,iV such that 



M. saCxiAi-'-x^rAjv- 
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(2.1) 



A special case is CP decomposition, which sets R„ = R for all n and fixes C to a super-identity. Let 
0Li^\ be the (i„, j)-th entry of A„. Then for CP decomposition, (2.1) becomes 

R N 

rriti-iN ~ X! n "-il^P ' ' ' 

j = l Tl=l 

Some properties. According to the above definitions, it is not difficult to verify that if AT = C Xi 
Ai • • • Xjv Ajv, then 

1 

yec{X) = (g) A„vec(C), (2.2) 

where 

1 

(g) A„^ Aw(^---(»Ai, (2.3) 
and A (g) B denotes Kronecker product of A and B. In addition. 




X(„)=A„C(„) (^AJ . (2.4) 



The properties in (2.2) and (2.4) will be used for efficient computation in the implementation of our algorithm. 
We also need the following properties of Kronecker product (see [12], for example): 

A®B(g)C = (A«)B)«)C = A(g)(B(8)C), (2.5a) 

(A(g)B)^ = A^(8)B^, (2.5b) 

(A(g)B)(C® D) = (AC)0 (BD), (2.5c) 

||A®B|| = ||A|| -IIBII, (2.5d) 

where ||A|| denotes the operator norm of A and equals the largest singular value of A. 

3. Alternating proximal gradient method. Recently, [37] characterized a class of multi-convex 
problems and proposed an APG method for solving these problems. Specifically, it considers 

s 

minF(xi, • • • jX^) = /(xi, • • • ,Xs) + V'rj(xj), (3.1) 

X ^ ^ 

i=l 

where variable x is partitioned into s blocks Xi, • • • , x^, / is assumed to be a differentiable and multi-convex 
function, and r.;, i = 1, • • • , s, are extended-valued convex functions. Here, / is called multi-convex if for each 
j, it is a convex function of x^ while all the other blocks are fixed. At the k-th iteration of APG, xi, • • • , x^ 
are updated alternatively from i = 1 to s by 

xf = argmin(V,/(x^^<„x,^x^^-l),x, - xf) + ^||x, - x,^||2 + r,(x,), (3.2) 

where x^^j is short for (xj, • • • ,x*^_^) and x^^^^i^ for (x*^_^j^, • • • ,x^~^), V if denotes the partial gradient of / 
with respect to x^, is a Lipschitz constant of Vi/(x*^^j, x^, x*^^^^) with respect to x^, namely, 

||V,/(x^^<„y„x^^-i) - V J(x^^<„z„x^^-i)||2 < i,'||y. - z,||2, V y„z„ 

4 



II • II2 denotes the Euclidean norm of a vector, and icf = x^'""^ + (x^""'^ — x(^~^) denotes an extrapolated 
point with weight > 0. It was demonstrated that a careful choice of cjf can significantly accelerate the 
algorithm. 

APG is a variant of the block coordinate minimization (BCM) method (see [33] and the references 
therein), which updates xi, • • • jX^ cyclically by minimizing the objective with respect to one block of vari- 
ables at a time while all the others are fixed at their most recent values, namely, 

xj^ = argmin /(x)^^, x,, x^^l) + r,(x,), for i = 1, • • • , s. (3.3) 

Though BCM decreases the objective faster, each subproblem in (3.3) is usually much more difficult than 
that in (3.2). For some simple r^, the update (3.2) has closed form solutions. For instance, if ri{xi) = S+{xi), 
an indicator function on nonnegative orthant which equals zero if all components of x^ are nonnegative and 
-l-co otherwise, then the update in (3.2) can be explicitly written as 



xf = max (^0,xf - V,/(x^^<„ xf , x^ ) j 



Under some boundedness assumptions, [37] establishes subsequence convergence of the APG method. 
Further assuming the so-called Kurdyka-Lojasiewicz (KL) property (see [5,25] for example), it shows that 
the sequence {x''} generated by (3.2) globally converges to a stationary point of (3.1) . The results are 
summarized below. 

Lemma 3.1. Suppose F is continuous and lower bounded on its domain, and suppose f is continuously 
differentiahle and has Lipschitz continuous partial gradient Wif for all i — 1, ■ ■ ■ , s. Let {x*^} be the sequence 
generated by (3.2) with parameters L^,ujf satisfying: 

1. There exist positive constants Ld, such that Ld < < for all i and k; 

2. For some constant S^^ < 1, it holds that oj^ < min ^1, (5^^ ^|^fc ^ for all i and k. 

Then any limit point o/jx*^} is a critical point of (3.1). Furthermore, if 

3. F(x'=) < F(x'=-i) for all k; 

4- {x*^} has a finite limit point x; 
5. F satisfies the KL property: 

|F(x) - i^(x)|^ 

^.^^^^^ (jf ( )) bounded in a neighbourhood of x for some 9 G [0, 1), (3-4) 

where dF{x) is the limiting subdifferential of F at x (e.g., see [31]), dist(0, 3-F(x)) ^ min{||y||2,y G 

dF{x.)}, and we adopt the convention 0° = |^ = 0, 
then the entire sequence {x*^} converges to x. 

Remark 3.1. In the lemma, items 1 and 4 can be made if the sequence {x*^} is uniformly bounded, and 
items 2 and 3 can be made if the parameters are chosen appropriately. The KL property (3.4) is important 
for the global convergence result. It is shown in [37] that this property holds for a great many functions such 
as the objective of (4.18) for sparse NTD and (6.4) for sparse NTD with missing values. 

4. Sparse nonnegative Tucker decomposition. This section applies the APG method introduced 
in Section 3 to the sparse NTD problem 

N 

minF(C,A) =f(C,A) + A,||C||i+ EA„||A„||i, 

C,A (4.1) 

s.t. C e M^i''-'''^",A„ e M^' n = l,--- ,N, 
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where Ad G is a given tensor, K^^^" contains all /„ x i?„ matrices with nonnegative components, 

A denotes (Ai, • • • ,An), 

e{C,A) = ^\\C x,A,---xnAn-M\\% 

is a data fitting term to measure the approximation in (2.1), |lC|li = ... is used to promote 

the sparsity of C, and Ac, Ai, • • • , Xn are parameters balancing the data fitting and sparsity level. 

The model (4.1) has been used in [24,26] for sparse NTD. Another model that has also been considered 
such as in [38] for sparse NTD uses the KuUback-Leibler divergence 

D{M\\g)= J2 f"in -.« log ^^^^^ (4.2) 

11.-. IN ^ 9zi-tN / 

instead of £{C,A) in (4.1), where G — C Xi Ai • • • x^r A^r. Since the logorithmic function requires its 
argument to be strictly positive, (4.2) is not used so often as £{C,A). In algorithms of NTD using (4.2) 
as a data fitting term, a small positive number e needs to be added to AA and Q for strict positivity. For 
simplicity, we only consider (4.1). 

4.1. Bound constraints for well-definedness. Before deriving the algorithm, we make some com- 
ments on (4.1). When all A's are positive, the objective of (4.1) is coercive, and it must have a bounded 
minimizer. However, note that for any positive scalars Sc,si,--- , sat such that their product equals one, 
{scC, siAi, ■ ■ ■ ,sjvAjv) does not change the value of £. Hence, if some A's vanish, the corresponding vari- 
ables would be unbounded such that the variables with positive A's would approach to zero, and (4.1) may 
not admit a solution. To tackle this problem, we add some constraints to enforce the boundedness of the 
variables with vanishing A's. If A„ = 0, we add 

A„ < max(l,||M||oo) (4.3) 

to enforce each component of A„ not greater than max(l, ||A^||oo), where |lAl|loo denotes the maximum 
component of Ad. If A^ = 0, we add 

C<max(l,||M||oo) (4.4) 

to bound C. The constraints in (4.3) and (4.4) are reasonable according to the following proposition. 

Proposition 4.1. If AA = C Xi Ai ■ ■ ■ x^v An for some (C,Ai,--- ,AAr), then there exists some 
(C, Ai, • • • , A^r) satisfying (4.3) and (4.4) such that Ad — C x Ai • • • x jv Ajv and (C, Ai, • • • , Ajv) has the 
same sparsity as that of {C, Ai, ■■ ■ ,Apf). 

Proof. Without loss of generality, assume no A„ has zero columns and ||A4||oo = 1- Let B^r be the 
mode-A^ matricization of C Xi Ai • • • XAr_i A^^i. According to (2.4), we have M(7v) — -A-atBat by observing 

C xi Ai • • • xat-i Atv-i = C Xi Ai • • • xjy-i An^i x^ In, 

where Ijv is an identity matrix. Choose a diagonal matrix D^r such that each column of A^r = AjvDat 
has unit maximum value. This Djy is invertible since An has no zero columns. Let Bjy = D^^Bjy. Then 
M(jv) = An'Bn- Since every involved matrix is nonnegative and each column of A^r has unit maximum 
value, it is straightforward to verify Bjv < 1. Note that folding Bn gives Cjv Xi Ai • • • x n-i An-i, where 
Ctv is obtained by re-scaling C such that the mode-iV matricization of Cat equals D^^Cjat). Therefore, 
repeating the above process will give Aat-i, • • ■ , Ai and C = Ci such that AA = C x Ai • • • x at Aat and 
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the constraints in (4.3) and (4.4) are satisfied. In addition, note that A„D„ and D^^B„ have the same 
locations of non-zeros as those of A„ and B„ for all n. This completes the proof. □ 

Remark 4.1. IfC Xi Ai • • • Xjv A]v is not exactly equal but close to M., one can magnify the bounds 
in (4.3) and (4.4) by multiplying some r > 1. 

For ease of description, we assume all A's to be positive in the next derivation of our algorithms, so there 
are no constraints as in (4.3) and (4.4) present. However, in the numerical experiments, we will add these 
constraints if some A's vanish. 

4.2. APG for sparse NTD. We first derive our algorithm for (4.1) and then discuss efficient com- 
putation for updates of the core tensor C and a factor matrix A„. Suppose the current iterate is (C,A). 
Applying the update (3.2) to C, we have 

Cnew = argmin(Vc^(C, A),C - C) + ^\\C - C\\l + X,\\C\U, 

C>0 ^ 

= max(^0,C--^Vc^(C,A)-^^ , (4.5) 

where is a Lipschitz constant of Vc^(C, A) with respect to C, namely, 

\\VciiCi,A)~VciiC2,A)\\F < L,\\Ci-C2\\f, VCi,C2, 
and C is an extrapolated point. Similarly, if the current iterate is (C, A), a factor matrix A„ is updated by 

(A„)„ow = argniin(VA„^(C, Aj<«, A„, Aj>„), A„ - A„) + ^||A„ - A„||| + A„||A„||i, 

A„>0 ^ 



= max ( 0, A„ - ^VaJ{C, Aj<„, A„, Aj>„) - ^ ) , (4.6) 

Lin L/„ 



where L„ is a Lipschitz constant of Va„^(C, Aj<„, A„, Aj>„) with respect to A„, and A„ is an extrapolated 
point. Perform the updates (4.5) and (4.6) to C, Ai, • • • , A^v cyclically, and we reach the APG method for 
sparse NTD. Its pseudocode is shown in Algorithm 1. 

Remark 4.2. We do re-updates in Line ReDo to make the objective nonincreasing. The monotonicity 
of the objective is important since the algorithm may perform unstably without the re-updates. Also, it is 
required in theory; see condition 3 of Lemma 3.L The objective evaluation is almost free compared to the 
gradient computations in (4.10) and (4.12); see the complexity analysis in Section 4-3- Moreover, in each 
one of our experiments, the re-update occurs only a few times (often less than 10), so it needs only a little 
more computations. 

The most expensive thing in Algorithm 1 is the computation of Vc^(C, A) and Va„^(C, A) in (4.7) and 
(4.8), respectively. Note that we have omitted the superscript. Next, we discuss how to efficiently compute 
them. 

Computation of Vc^. According to (2.2), we have 

£(C,A) = -||( (g) A„)vec(C) - vec(A<)||J, 

where 0^^jv defined in (2.3). Using the properties (2.5a)-(2.5c) of Kroncckcr product, we have 

1 1 
vec(Vc^(C, A)) = ( (g) A;[A„)vec(C) - ( (g) AI)ycc{M). (4.9) 

n=N n=N 
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Algorithm 1: Alternating proximal gradient for sparse nonnegative Tucker decomposition 



Data: tensor A^, core dimension (Ri, • ■ ■ , Rn), parameters Ac, Ai, • ■ ■ , Ajv > 0, and (C^^, A^^) = (C'\ A' 
for fe = 1, 2, ■ ■ do 

Choose Lc to be a Lipschitz constant of Vc^(C, A'°^^) about C. 
Choose uj^>0 and set C*" = C^'^ + cu^iC''-^ - C''"^). 
Update C by 

k 



C"=max(0,c''--i:Vc^(C^A'=-l^ ^'^ 



for n = 1, 



,A/^ do 



Choose Ln to be a Lipschitz constant of V A^^iC' , Aj^„, A„, A^j'^) about A„. 
Choose oj^ > and set A* = A*"^ + a;^(Aj;"^ - A^"^). 
Update A„ by 

A,i = max ( 0, A„ — -r^ Va„^(C , Aj^„, A„, Aj^„) 



(4.7) 



(4.8) 



if F(C'-', A*^) > F(C'=~\ A'^-i) then 
^ Re-update C'' and A*" by (4.7) and (4.8) with C*" = C*""^ and A* = A*"^ for all n, respectively. 

if Some stopping conditions are satisfied then 
[_ Output (C'°,AJ,--- ,A^) and stop. 



It is extremely expensive to explicitly form the Kronecker products in (4.9). Fortunately, we can use (2.2) 
again to have 

1 

vec(C xi A^Ai • • • Xat A^A^) = ( (g) A;|;A„)vec(C) 

n=N 

and 

1 

vec(M xi a7 • • • xat A^) = ( (g) Al)vec{M). 

Hence, we have from (4.9) and the above two equalities that 

Vc^(C,A) = C xi A^Ai • •• x^ A^Aat - M xi A^ - •• xat A^. (4.10) 
Computation of Va„^- According to (2.4), we have 

^(C,A) = -||A„C(„)((g)A,)^-M(„)f^. (4.11) 

Hence, 

Va J(C, A) = A„(B„B;^) - M(„)B;^ (4.12) 

where 

1 

B„ = C(„)((g)A,)^. (4.13) 

i=N 

Similar to what has been done to (4.9), we do not explicitly form the Kronecker product in (4.13) but let 

AT = C xi Ai • • • x„_i A„_i x„+i A„+i • • • xat Am- (4.14) 
Then we have B„ = X(„) according to (2.4). 



4.3. Complexity analysis. The main cost of Algorithm 1 hes in computing Vc^{C ' , A*^^„,A^>^) 
and Va„^(C'''", A*^^„, A^, A*^^^), which are required in (4.7) and (4.8), respectively. For simplicity, we omit 
all superscripts in the following analysis. Through (4.10), the computation of Vc^(C, A) requires 

N N N N j N 

c I E R'l^ + E n + E (11 ^0 ( n ^0 1 (4-15) 

-1 J — 1 i—l i — 1 i—1 i—j 

flops, where C « 2, the first part comes from the computation of all A^A^'s, and the second and third 
parts are respectively from the computations of the first and second terms in (4.10). Neglecting^ the time 
for unfolding a tensor and using (4.12), we have the cost for Va„^(C, A) to be 



C 



( \ 

ri-l j N n-1 N j N N 

E(n^o(n^o+^"(n^o e ( n ^odi^o+^nn^'+^-^^+^-n^^ 

j — l i—l i—j i—l j—n-^-l 2— n+1 i—j i^n 



y part 1 part 2 



(4.16) 



where C is the same as that in (4.15), "part 1" is for the computation of B„ via (4.14), "part 2" and "part 
3" are respectively from the computations of the first and second terms in (4.12). 

Suppose Ri < li for alH = 1, • • • , A^, and J|,- Ri ^ /j. Then the quantity of (4.15) is dominated by 
the third part, and the quantity of (4.16) is dominated by the first and third parts. Only taking account 
of the dominating terms, we claim that the quantities of (4.15) and (4.16) are similar. To see this, assume 
Ri = R,Ii = /, for all i's. Then the third part of (4.15) is X^jLi R' , and the sum of the first and 
third parts of (4.16) is 

n-l j N n-1 N j N N 

E (11^0(11^0 + ^411^0 E ( n ^0(11^0 + ^^11^^ 

j — l i—l i—j i—l j—n-\-l 2—71+ 1 i—j i—l 

n-1 N 

j = l j=n+l 

N N-n+1 

= E Ri''''^'+ E Ri'^^'^^ + Ri'' 

j = N-n+2 j=2 
N 

3=1 

Hence, the costs for computing Vc^(C, A) and Va„^(C,A) arc similar. 

After obtaining the partial gradients Vc^(C, A) and Va„^(C,A), it remains to do some projections to 
nonnegative orthant to finish the updates in (4.7) and (4.8), and the cost is proportional to the size of C and 
A„, i.e., Cp YiiLi R ^iid CpInRn with Cp ~ 4. The data fitting term can be evaluated by 

£{C,A) - i ((A;[A„,B„B,j) - 2(A„,M(„)B^) + \\M\\%) , 

where B„ is defined in (4.13). Note that A^A„, B„BX ^iid have been obtained during the 

computation of Vc^(C, A) and Va„^(C, A), and can be pre-computed before running the algorithm. 



^In tensor-matrix multiplications, unfolding and folding a tensor both happens, and they take a small amount of time in 
the whole process of tensor-matrix multiplication. 
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Hence, we need C{R'^^ + additional flops to evaluate £{C, A), where C « 2. To get the objective value, 

we need C(ni3,i Ri + J2iLi hRi) more flops for the regularization terms. 

Some more computations occur in choosing Lipschitz constants Lc and L„'s. When i?„ ^ /„ for all 
n, the cost for computing Lipschitz constants, projection to nonnegative orthant and objective evaluation 
is negligible compared to that for computing partial gradients Vc^(C,A) and Va„^(C,A). Omitting the 
negligible cost and only accounting the main cost in (4.15) and (4.16), the per-iteration complexity of 
Algorithm 1 is roughly 

(N j N N J N \ 

j = l 1=1 i=j j=l 1=1 i=j J 

where C « 2. If = 0(1) and max„ i?„ < 0{\ogYlf^i li), then the per-iteration cost of Algorithm 1 is 
scalable about the data size YiiLi ^i- Here, we do not assume M. to be sparse. If it is, we can employ the 
sparsity of A4, C and A to further reduce the computational complexity. 

4.4. Convergence results. We re-write (4.1) in the unconstrained form 

N 

mm ^(C,A)-t-Ae||C||i+(5+(C) + ^(A„||A„||i+J+(A„)), (4.18) 

n=l 

where (5+(-) is the indicator function on nonnegative orthant. According to [37], the objective of (4.18) is a 
semi-algebraic function [4] and satisfies the KL property (3.4). Therefore, according to Lemma 3.1, we have 
Theorem 4.2. Let {W''' = (C'^,A''')} be the sequence generated by Algorithm 1. //Ac,Ai,--- ,Xn are 
all positive, and 

1. There exist positive constants L^i^L^ such that L^^L^^ € [L^,L^; 

2. There is positive constant < 1 such that iJ^ < min ^1, Suj\J^^f^^ and < min ^S^^J^^^^^ for 

all n and k; 
then converges to a stationary point W. 

Proof. It suffices to verify all conditions required by Lemma 3.1. Obviously, the objective of (4.18) is 
continuous and lower bounded on its domain. It satisfies the KL property (3.4) from the above discussions. 
If {F{yv^)} is nonincreasing, then the positivity of Ac, Ai, • • • , Aat implies the boundedness of {W'^} and 
the existence of a finite limit point VV, so it remains to verify the monotonicity of {F(W'^)}. 

If the re-update step in Line ReDo does not occur, then FiyV^) < F{yV^^^). If it occurs, then from 
Lemma A.l, we know that the re- update makes the objective nonincreasing. Therefore, all conditions in 
Lemma 3.1 are satisfied and, thus the convergence result holds. □ 

Remark 4.3. Since {W*^} is bounded, the existence of Lj^ and can be guaranteed if we take and 
Li as 

Li = max (L„i„, \\{A%-'Y A)^' (AJ-I)TaJ-i || ) , (4.19) 

and 

L^ = max(L,„i„,||Bf,(B:;)T||), (4.20) 

where Lmin > is a constant, and 

K = Ct„) (A^"' KlX ® K-1 ® • ■ ■ ® ^iV ■ (4-21) 
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Fig. 5.1. Results by APG with two different orders of updating the core tensor and factor matrices. The red dashed line 
corresponds to the order o/C, Ai,--- ,Apf and the black solid line to the order of C, Ai,C, A2, ■■ ■ ,C,Ajv 




If some A„ and/or X^. vanish, we can add constraints (4.3) and/or (4.4) to enforce the iterate to be bounded. 
With these constraints, we further do projections 

C'' = min(max(l,|lM|loo),C'=) (4.22) 

after (4.7) and 

Aj; = min (max(l, ||M||oo), A^) (4.23) 

after (4.8). 

5. Improved APG for sparse NTD. Algorithm 1 updates the core tensor and factor matrices in 
the order of C, Ai, • • • , A^- It usually gives a nice solution. However, we observe that it converges not fast. 
This section improves Algorithm 1 by updating the core tensor and factor matrices in a different order. 

Since the core tensor C interacts with all the factor matrices, updating it more frequently is expected to 
speed up the convergence of the algorithm. In addition, the update of C costs roughly the same as that of a 
factor matrix; see the complexity analysis in Section 4.3. On the other hand, it is also important to update 
each factor matrix sufhciently often. To maintain an overall fast convergence speed, we update the core 
tensor and factor matrices in the order of C, Ai, C, A2, • • • , C, A^r. Algorithm 2 summarizes the improved 
method. 

To see how the updating order affects the algorithm, we compare Algorithms 1 and 2 on a 30 x 30 x 30 x 30 
random tensor with a5x5x5x5 core tensor. Figure 5.1 plots their relative errors at each iteration. We see 
that Algorithm 2 uses about 50% less iterations than Algorithm 1 to achieve the same accuracy. Algorithm 
2 is also faster than Algorithm 1. The former uses about 9.6 sec while the latter runs about 14.2 sec. 

From the complexity analysis in Section 4.3, we have that the per-iteration cost of Algorithm 2 is roughly 

(N j N N j ^ \ 

j = l 1=1 i=j j=l i=l i=j j 

where C « 2. Since Algorithm 2 uses a different block updating order, its convergence cannot be directly 
obtained from Lemma 3.1. However, we can obtain the same result as that in Theorem 4.2. It indicates that 
the convergence results in Lemma 3.1 should also hold for APG that updates the blocks in an essentially 
cyclic way, namely, each block of variables is visited at least one time within a fixed number of iterations. 
The proof of Theorem 5.1 is given in the Appendix. 

Theorem 5.1. Let {ys^^ ^ (C''',A'=)} be the sequence generated by Algorithm 2. If Ac, Ai, • • • , Xn are 
all positive, and 

1. There exist positive constants Ld,Lu such that Ljf'",L^ e [Ld,Lu]; 
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Algorithm 2: Improved alternating proximal gradient for sparse nonnegative Tucker decomposition 



Data: tensor , core dimension , • ■ ■ , Rn), parameters Ac,Ai,-- - ,Ajv>0, and (C ^ , A ^ ) = (C'\ A*^ 
for /c = 1, 2, ■ ■ do 

Set e"'-' = e"'" = C° if = 1 and C*'-^ = C^-^'^'S C"'" = C"'^''^ otherwise, 
for n = 1, ■ ■ ,N do 



Choose L*'" to be a Lipschitz constant of Vc^(C, A 



fc » fe-i 

.7<ni ^j>r. 



about C. 



Choose cj,*^'" > and set C ' = C'='""' + a;^"(C'''"-' - C*'"-^). 
Update C by 



max I 0, C 



^ -Vc^(c''",A,^<„,A)>^) 



A, 

■ k.n I ' 



L 

Choose to be a Lipschitz constant of Va„^(C*'", A*'^^, A„, A*'",^) about A 
Choose cjI; > and set A* = Ajj"-^ + a;^(A^"-^ — A^~^). 
Update A„ by 



if F(C'='",A,^<„,A'?-^) >i^(C'^ 



(5.1) 



(5.2) 



A,^<„,A^*>i)then 

Re-update C'"'" and A^ by (5.1) and (5.2) with c'"'" = C*^'""^ and A* = A*-\ respectively. 
Set =C'''^. 

if Some stopping conditions are satisfied then 
[_ Output (C*^, A*, • • • , A^) and stop. 



^. There is a positive constant < 1 such that lo^'^ < 5u>\j £fc,,i and w„ ^ 5lj\J £fc /or a// n and 

/c, where we use the notation L^'^ = L^^^'^ ; 
then converges to a stationary point VV of (4.1). 

6. Sparse nonnegative Tucker decomposition v^rith missing values. For some applications, the 
target tensor may not be fully observed. This section discusses how to modify Algorithms 1 and 2 to 
handle this case. The problem is formulated as 

N 

min Fn{C, A) = \\\Vn{C x i Ai • • • x jv Aat - M)\\l + A,||C||i + E A„||A„||i, 

C,A (6.1) 

s.t. C e M^^''-'''^",A„ e M^'''-"-, n = i,--- ,7V, 

where Vl indexes the observed entries of Al, and Vni'A) keeps the entries of ^ in and zeros out all others. 
Introducing variable X and restricting Vn{X) — Vq,{M.)^ one can write (6.1) equivalently to 

N 

min i||C xi Ai--- x^Aat- Ar|||, + A,||C||i+ ,^ ^, 

s.t. CeM^^^-^^",A„eM^"^^", 71 = I,--- ,iV, T'oW =n2(M). 

To modify Algorithm 1 for (6.1) or equivalently (6.2), we set X'^ — 'Pq,{M.) in the beginning. At the fc-th 
iteration of Algorithm 1, we use M. ~ X^^^ ^ wherever M. is referred to. Specifically, we replace M. with 
X^~^ in the computation of Vc^(C ,A'^^^) and Va„^(C'^, A*^^„, A^, A^~^) and the objective evaluations 
F{C'',A^) and F(C''"\ A''^"!). After Line ReDo of Algorithm 1, perform (3.2) to X. The update can be 
explicitly written as 



X"" = Vn{M)+Vn^{C'' Xi A^ • • • x^ A^). (6.3) 
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Algorithm 2 can be modified in the same way to solve (6.1) or (6.2). We do not repeat it here. 

Compared to Algorithms 1 and 2, the modified versions need extra computation for the update (6.3), 
which costs about 

N 3 N 

2E(n^o(n^o- 

j=l 1=1 i=i 

Therefore, the per-iteration complexity of the modified algorithms is still scalable about the data size if 
N = 0(1) and max„ i?„ — 0(logni^i li)- In addition, since the function 



^\\Vn{C xi Ai • • • Aat - M)\\l + ^c\\C\\i + S+{C) + ;^(A„|lA„|li + <5+(A„)) (6.4) 

n=l 

is also semi-algebraic and satisfies the KL property (3.4), the convergence results in Theorems 4.2 and 5.1 
hold for the modified algorithms. 

7. Extensions. For some applications, the core tensor C may not be required nonnegative. Algorithm 
2 can be modified to handle this case^ by changing the update for C in (5.1) to 

(^"^"-^Vc^(^''^A)<„,Ay)), (7.1) 

where 5p(Af) is a soft-thresholding operator defined componcnt-wisely as 5^ (a;) — sign(a:) • max(0, \x\ — fi). 
For some other applications, one may expect or know from priori information that some components of C have 
relatively large magnitute. To handle this case, one can impose different weights on different components of 
C in its sparsity regularizer, namely, replacing the term Ac||C||i in (4.1) with ||Wc0C||i, where "0" denotes 
component-wise product. The update (5.1) is then changed to 

C'^"" = -ax (o,C^'" - ^Vc^(c'=^ A,^<„, A^) - ^W, 

if there is nonnegativity constraint on C, or 

•^k.n 1 

^ V' ~ 

if there is no nonnegativity constraint on C. Similarly, one can modify Algorithm 2 to handle the case that 
some A„ is not required nonnegative. 

The APG method can also be adapted to solve sparse higher-order principal component analysis (HOPCA), 
which imposes orthogonality constraint on each factor matrix. The problem is formulated as 

N 

min i||C xi Ai • • • xat Aat - Mill + Ac||C||i + ^ A„||A„||i, 

C,A [t.Z) 

s.t. AjA„ =I„, n = I,-- - ,N, 

where I„ is an identity matrix of appropriate size. When Ac = 0, the optimal C = M. x i Aj • • • x jv Ajv, and 
one can eliminate C as shown in [19]. The concurrency of sparsity and orthogonality constraints makes the 
problem much more difficult. The work [1] considers rank-1 factor matrix with only one column and relaxes 
the orthogonality constraint to AJA„ < 1. Then it applies block coordinate minimization method to solve 
the relaxed problem. When some A„ has more than one columns, we relax (7.2) to 

N N 



mm i||C xi Ai-- - x^ A^ - M|||, + Ac||C||i + E A«|| A„||i + f ^ «^ar^ 



\ 2 



ra=l 



(7.3) 



S.t. ||a„j||2 < 1, n = 1, • • • ,iV,Vj, 



^Algorithm 1 can be modified in the same way for tlie variants. 
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where a„j denotes the j-th column of A„, X^i^j (^n i^n.i) used to promote the orthogonaUty of A„, and 
is a penahy parameter. 

To save space, we describe a method to solve (7.3). Our method is similar to Algorithm 2 and cycles 
over the variables by C, Ai,C, A2,- • • ,C, A^r. The update of C is done by (7.1), and A„ is updated one 
column by one column. Specifically, assume the current iterate is (C*^'", A^^„, A^^^^). Let Bj;j be the one 
obtained from (4.21). Using (4.11), we update the columns of A„ from j = 1 to i?„ by 



atj- argmin J||a„,,b^^^' + (A^),c(B^)^'^ - M(„)||^ + A„||a„j|li (7.4) 



" A' ((-A-^l)j<:(AfjJcafj a„j- — a^j) H :r-^lla„..; — at 



where bj^'^ denotes the j-th row of B^, (BfJ-' is the submatrix by taking all rows of B^ except the j-th one, 
(A^)jc is short for (aj^i,--- , a^j.^, af--^^, • • • ,a^7^ij, af^^^- = a^-^ + ,,-(a^-^ - a^-^) is an extrapolated 

point, and L'^ j is a Lipschitz constant of the gradient of ^ {j2i<j {^n.j^n.iY +J2i>j {^n.j^t^^Y^ with 
respect to a„ Note that 

||a„,,b:;-^ + (A^;),c(B,^;)^-^-M(„)||^ 
= l|b:;'-'||^||a„,,||2 + 2(((A:;),c(B:;)^-^ -M(„))(bf;'-'y,a,,,^ + ||(A:;),e(B:;)^-^ - M 

Then one can easily write the update in (7.4) explicitly as 

((Aj;),e(BfJ^-^-M(„))(bf;^y ^ 



,,2 
(n)ll_F- 




b + iiL b + /iL 



{A^),4A^)^.kl^\ , (7.5) 



where b = ||bj^'^ ||2, L — L^j, and Vbi denotes the projection to unit Euclidean ball. 

Since (7.3) is multi-convex under the partition (C,ai,i,--- ,a.iii^,--- ,a.N,i,--- ,ajvj^„), the method 
described above has global convergence if the parameters j , ut'^ j , L^'"' , uj^'" satisfy conditions as those in 
Theorem 5.1. We do not repeat it here. 

8. Numerical experiments. In this section, we compare Algorithms 1 and 2, the multiplicative update 
method (Mult) in [26], column-wise coordinate descent (CCD) in [24] and hierarchical alternating least 
squares (HALS) in [29] for solving (sparse) NTD on both synthetic and real world data. Also, we test the 
modified versions of Algorithms 1 and 2 and Mult for solving (sparse) NTD with missing values. Below we 
call Algorithm 1 and its modified version as APG-I, Algorithm 2 and its modified version as APG-II. All the 
tests are performed on a laptop with an i7-620m CPU and 3GB RAM and running 32-bit Windows 7 and 
Matlab 2010b with Tensor Toolbox of version 2.5 [2]. 

8.1. Implementation details. This subsection specifies the implementation of Algorithms 1 and 2 
and their modified versions in details about initialization, parameter settings, and stopping criteria. 

Initialization. For all the compared algorithms, we use the same starting point. Throughout the tests, 
we first randomly generate Aj*, • • • , A^^ and then process them by the HOSVD algorithm in [8]. Specifically, 
for (4.1), we let 

B = Mx, (A;)T • • • x„_i (A^i)^ x„+i (AO+i)T xm (A^)^, (8.1) 

and update 



A° = max(0, U„) 
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alternatively for n — 1, - ■ ■ ,N, where U„ contains the left f?„ singular vectors of B(„). Then set 

C°-Mxi(A;)T...x^(AO,)T. (8.2) 



For (6.1), we use the same initialization except replacing to 7^n(Al) in (8.1) and (8.2). It is observed that 
all the algorithms perform better with this kind of starting point than a random one, in both convergence 
speed and chances of avoiding local minima. 

Parameter settings. The parameters that have not yet been set in Algorithms 1 and 2 are Lipschitz 
constants L^'",L^,L^j and extrapolation weights oj^^", w^', oj^. For and Ljj in Algorithm 1, we set them 
according to (4.19) and (4.20), where imin = 1 is used. Note that in (4.19), we do not need to form the 
expensive Kronecker product because according to (2.5d), we have 



N 



fc-l\T A 



(A^-^) ' A 



N 



(A^l)^A^l||=^lKA^'^A^l| 



For L^^", in Algorithm 2, we set them in the same way, namely. 



= max(l,||(A^-i)TA^-i. 



and 



i:;=max l,||B:;(Bf;)'| 



(Aj)^AJII), 



where 



In addition, we take 



B 



eg (A 



k-l 
N 



, 0.99991 



k.71-— 1 



At) 



(8.3) 



where 



follows 



In the same way, 



4-k.n — l 



,k.O 



uj: = min (I;'',0.9999i 



Li 



for k > 1 , n = 1 , 



= min I w'^, 0.9999 



,iV. 



fc-i 



(8.4a) 
(8.4b) 
(8.4c) 



(8.5) 



where w follows 



j-fe-i 



1, 



i' = ^(l + yi + 4(^^) 



for fc > 1. 



(8.6a) 
(8.6b) 
(8.6c) 
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Fig. 8.1. Performance of APG-I (left) and APG-II (right) with and without extrapolation. The black solid curves for 
"with extrapolation" use the weights in (8.3) and (8.5), and "no extrapolation" sets ui^''^ ,uj^,lj^ to zero for all k and n. 
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We perform "min" operation in (8.3) and (8.5) for convergence. 

The weights o)^'" in (8.4) and oj'' in (8.6) are the same as that used in FISTA [3] for convex problems. 
It has been shown that this kind of extrapolation weight can significantly accelerate the proximal gradient 
method for convex problems both theoretically and numerically. Figure 8.1 shows that the extrapolation 
technique using the weights given in (8.3) and (8.5) significantly speeds up both APG-I and APG-II for 
NTD, which is a highly non-convex problem. In the test, we randomly generate an 80 x 80 x 80 tensor, and 
the core size is set to {Ri, R2, R3) = (5,5,5). We also tested APG with the dynamically updated weight 
used in [23, 36] for non-convex matrix completion problem and observed that APG performs as well as that 
with the above extrapolation weights. Hence, in some sense, our algorithm is robust to the extrapolation 
technique. 

Stopping criteria. We stop Algorithms 1 and 2 if a maximum number of iterations or maximum time 
is reached or one of the following conditions is satisfied 

\\C' x,A1---xnA%-M\\f , r , 

< tot, tor some k, (8.7a) 

\\M\\f 



\Fk^Fk+i\ 
1 + F, 



< tol, for three consecutive /c's, (8.7b) 



where Fk = F(C'', , • • • , A%) and tol is a small positive value specified below. For the modified versions 
of Algorithms 1 and 2 discussed in Section 6, we use the same criteria except replacing (8.7) to 

< tol, lor some k, (8.8a) 



\VniM)\\F 

< tol, for three consecutive k's, (8.8b) 



where ^ ^^^^(C^A^,•• • 



8.2. Nonnegative Tucker decomposition. This subsection compares APG-I, APG-II, Mult and 
HALS for NTD, which sets the sparsity parameters Ac, Ai, • • • , Aat to zero in (4.1). Throughout our tests, 
we add constraints (4.3) and do update (4.23) in Algorithms 1 and 2 if some A„ = 0. Similarly, we add (4.4) 
and do update (4.22) if Ac = 0. We first test the four algorithms on two sets of synthetic data and then 
APG and HALS on two image datasets. 

Synthetic data. In the first synthetic dataset, a 50 x 50 x 50 nonnegative tensor M. is generated in the 
form of Ad = C Xi Ai X2 A2 X3 A3, where C is generated by Matlab's command rand(5,5,5) and all A^'s 
by command max(0,randn(50,5)). Then Ad is re-scaled to have unit maximum component. The tensor 
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in the second test is generated in the same way but has an unbalanced dimension 10 x 10 x 1000, and 
the core tensor is 3 x 3 x 30. We emphasize that uniformly random C makes the problem more difficult than 
Gaussian random one because the former is not zero-mean. The true dimension is used in our tests, namely, 
/„ = 50, Rn = 5,n = 1,2,3 is set in (4.1) for the first dataset and {h^hJa) = (10, 10,1000), (i?i, i?2, -R3) = 
(10,10,1000) for the second one. We run the four algorithms to maximum time imax = 20 (sec) and 

■ w lie Xi X2 X3 - , /^r■ArAr■Ar^• w • u 

compare their relative error , where (C ,A[, A2, A3) is a solution given by 

||Aa||F 

an algorithm. Figure 8.2 plots how the relative error changes with the running time for each algorithm. From 
the figure, we see that Mult converges very slow in both cases and HALS works well for Ad with balanced 
dimension but converges slow for the unbalanced one. AFG converges faster than both Mult and HALS, in 
particular for the unbalanced case. AFG-H is slightly better than AFG-I in both cases. 

Image data. Due to the poor performance of Mult, we only compare AFG with HALS in the next two 
tests. The first test uses the Swimmer dataset constructed in [9], which has 256 swimmer images and each 
one has resolution of 32 x 32. We form a 32 x 32 x 256 tensor M. using the dataset and then re-scale it 
to have unit maximum component. The core dimension is set to (24,20, 20)'^. We run AFG and HALS to 
imax = 30 (sec) and plot their relative errors on the left of Figure 8.3. The second test uses a brain MRI 
image of size 181 x 217 x 181, which has been tested in [24] for sparse nonnegative tensor decomposition. 
We re-scale it to have unit maximum pixel and set the core size to (30, 30, 30). AFG and HALS both run to 
^max = 600 (sec), and the relative errors are plotted on the right of Figure 8.3. We see that HALS decreases 
the objective faster than AFG in the beginning but AFG eventually converges faster. In particular for 
the test with Swimmer dataset, the overall convergence speed of AFG is much faster than that of HALS, 
and AFG reaches much lower relative errors while HALS seems to be trapped at some local minimizer. In 
addition, AFG-II is faster than AFG-I in both tests, and the improvement is obvious in the first one. 

8.3. Sparse nonnegative Tucker decomposition. This subsection compares AFG, Mult and CCD 
for sparse NTD, which has at least one positive sparsity parameter among Ac, Ai, • • • , Ajv in (4.1). HALS is 
not coded for sparse NTD. In its implementations, all factor matrices are re-scaled such that each column has 
unit length after each iteration. The re-scaling is necessary for efficient update of the core tensor and does not 
change the objective value of (4.1) if all sparsity paramenters are zero. However, it will change the objective 
if some of Ac, Ai, • • • , Xn are positive. Hence, this subsection docs not include HALS for comparison. 



•^The mode-n ranks of are 24, 14, and 13 for n = 1,2, 3, respectively. Larger size is used to improve the data fitting. 
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Fig. 8.3. Results by APG and HALS on Swimmer dataset (left) and a brain MRI image (right). 




Running Time (sec) Running Time (sec) 

Table 8.1 

Average results by APG and CCD on synthetic data by fixing the core tensor to a super-identity 





APG 


CCD 


R 


time 


obj. 


rel. err. 


density 


time 


obj. 


rel. err. 


density 


5 


2.95e-l 


0.2505 


1.12e-4 


54.73% 


7.58e-l 


0.7312 


2.86e-4 


65.52% 


10 


5.49e-l 


0.5222 


1.30e-4 


57.63% 


l.lOe+0 


1.2355 


4.46e-4 


67.90% 


15 


1.03e+0 


0.8776 


1.55e-4 


61.47% 


1.68e+0 


4.9889 


6.56e-3 


68.23% 



Identity core tensor. First, we compare APG with CCD on a synthetic dataset and the brain MRI 
image used above. Since CCD is extremely inefficient to update the core tensor, we fix the core tensor to a 
super-identity T, namely, we test the algorithms for sparse NCP. Note that APG-I and APG-II are now the 
same since the core tensor is fixed. There are other NCP solvers such as [15, 17], which nevertheless do not 
include sparsity regularizers or constraints in their implementations. 

For the synthetic test, each tensor is 100 x 100 x 100 and generated in the form of Ad = X x i Ai X2 

A2 X3 A3. Here, Ai, A2 are generated by Matlab's command abs (randnClOO ,R) ) and A3 by rand(100,R), 

and all A„'s are re-scaled to have unit maximum component. In addition, 50% components of each A„ are 

selected uniformly at random and set to zero. R varies among {5, 10, 15}. We set Ri — R2 — R3 ~ R and 

Ai = A2 = A3 = 0.001 in (4.1). The maximum number of iterations is set to 1000 for APG and 200 for CCD 

since per-iteration complexity of CCD is much higher, and the stopping tolerance is set to tol — 10^^ in 

(8.7) for APG and lO^^ fo^. qqj^ gj^ce tol = 10"* is too tight for CCD. AU other parameters of CCD are set 

to their default values. Table 8.1 reports the average running time (sec), objective values, relative errors = 

llXxi A^ X2 A^ X3 A5-M||f # nonzeros of a;; 

and density = — over 20 mdependent runs, where 



||M||f 300i? 
(A^^, A2, A3) is a solution given by one algorithm. For the brain MRI image, we set Ri = R2 = R3 = R 

with R varying among {20,30,40} and Ai = A2 = A3 = 0.5 in (4.1). The maximum number of iterations 

and stopping tolerance are set as the same as above. Table 8.2 reports the average results of 10 independent 

trials. From the results in Tables 8.1 and 8.2, we see that APG is faster and also achieves lower objective 

values and relative errors. The densities given by the two algorithms are similar. 

Non-identity core tensor. Secondly, we compare APG and Mult on the brain MRI image used above 
and the CBCL face image dataset^ which has been tested in [32] for nonnegative tensor decomposition. For 
the brain MRI image, we set Ri = R2 = R3 = 30 and Ac = Ai = A2 = A3 = 0.5 in (4.1). We run APG 



*http: / /www. ai.mit.edu/projects/cbcI 
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Table 8.2 

Average results by APG and CCD on a brain MRI image by fixing the core tensor to a super-identity 





APG 


CCD 


R 


time 


obj. 


rel. err. 


density 


time 


obj. 


rel. err. 


density 


20 


3.16e+l 


4.1960C+3 


1.42e-l 


50.06% 


1.70e+2 


4.6173e+3 


1.44e-l 


50.99% 


30 


6.10e+l 


3.0534C+3 


1.06e-l 


45.98% 


1.89e+2 


3.6463e+3 


1.14e-l 


46.76% 


40 


9.35e+l 


2.7788e+3 


9.28e-2 


40.52% 


2.07C+2 


3.2940e+3 


l.Olc-1 


42.99% 



Table 8.3 

Average results by APG and Mult on a brain MRI image with the core size R\ = R2 = R3 = 30 





APG-I 


APG-II 


Mult 


time 


obj. 


rcl. err. 


fac. den. 


core den. 


obj. 


rcl. err. 


fac. den. 


core den. 


obj. 




rcl. err. 


fac. den. 


core den. 


100 


1.5982e+3 


4.56C-2 


29.28% 


11.32% 


1.0495e+3 


3.91e-2 


25.50% 


11.52% 


6.1071eH 


-3 


1.740-1 


32.42% 


29.94% 


200 


9.82160+2 


2.78C-2 


24.20% 


15.57% 


6.8536e+2 


2.27e-2 


20.42% 


16.84% 


4.7282eH 


-3 


1.47e-l 


30.64% 


32.09% 


300 


8.I22O0+2 


2.34C-2 


20.23% 


16.82% 


6.2003e+2 


1.99e-2 


19.03% 


18.04% 


4.0218eH 


-3 


1.30e-l 


29.11% 


31.36% 



and Mult to imax = 300 (sec) and report the results at time t — 100,200,300 (sec). Table 8.3 summarizes 

-jj- nonzGros of 

the average results of 10 independent runs. The "core den." is calculated by and "fac. 

1 1 ■QQj-j^^cros of -A.'^ 

den." by — r^. We see that APG reaches much lower objective values and relative 

^ 30- (181 + 217+181) 

errors than those by Mult. In addition, the solutions obtained by APG are sparser than those by Mult and 
are potentially easier to interpret. APG-II is slightly better than APG-I in both objective values and data 
fitting. 

The GBCL dataset has 6977 face images, and each one is 19 x 19. We use all these images to form a 
19 X 19 X 6977 nonnegative tensor Al, which is then re-scaled to have unit maximum component. The core 
size is set to (i?i, i?2, -R3) = (5,5,50) and the sparsity parameters to Ac = 0.5, Ai = A2 = A3 = 0, namely, 
we only want the core tensor to be sparse. Table 8.4 reports the average results obtained by APG and Mult 
at running time t = 25,50,75,100 (sec). We see that APG reaches much lower objective values and also 
lower relative errors than those by Mult. The solutions given by APG are much sparser than those by Mult. 
This may be because APG uses the constraints (4.3) while Mult does not. APG-I produces almost the same 
relative errors as those by APG-II and gives slightly greater objectives. 

8.4. Sparse nonnegative Tucker decomposition with missing values. This subsection tests APG 
for (6.1) on synthetic data and compares it with Mult on the brain MRI image used above. 

Performance of APG vi^ith different samples. First, we show that APG using partial observations 
can achieve similar accuracies as that using full observations. We generate each Ad in the form of Ad ~ 
C X 1 Ai X A2 X A3 and then re-scale it to have unit maximum component, where C is generated by Matlab's 
command max(0,rcLndn(R,R,R)) and each factor matrix A„ by max(0,rajidn(50,R)) with R varying among 
{5, 8, 11, 14, 17, 20, 23, 26}. We choose SR = 10%, 30%, 50%, 100% samples uniformly at random and compare 
the performance of APG using different samples. The maximum number of iterations is set to 5,000 and 
the stopping tolerance to tol = 10~^. Figure 8.4 plots the average relative errors and running time (sec) of 
APG-I over 20 independent trials, and Figure 8.5 plots the results of APG-II. We see that APG using 30% 
and 50% samples gives similar accuracies as that using full observations. APG with 10% samples can still 
make relative errors low to about 1% as i? < 14, but 10% samples seem not enough when R > 17. With 
10% samples, APG-I performs better than APG-II, while the latter becomes better as the samples increase 
to 30% and 50%. Longer time by APG with partial observations is due to the extra update (6.3) and more 
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Table 8.4 

Average results by APG and Mult on CBCL dataset with the core size {R\, R2, R^) = (5,5,50) 





APG-I 


APG-II 


Mult 


time 


obj. 


rel. err. 


core den. 


obj. 


rel. err. 


core den. 


obj. 


rel. err. 


core den. 


25 


3.2707C+4 


2.73e-l 


16.01% 


3.1891e+4 


2.70e-l 


8.83% 


6.2191C+4 


3.70e-l 


92.67% 


50 


3.1389e+4 


2.68e-l 


10.14% 


3.1378e+4 


2.68e-l 


6.91% 


5.4659C+4 


3.45e-l 


73.99% 


75 


3.1326e+4 


2.67e-l 


8.89% 


3.1311C+4 


2.67e-l 


6.58% 


5.1177C+4 


3.33e-l 


62.78% 


100 


3.1317e+4 


2.67e-l 


8.34% 


3.1311C+4 


2.67C-1 


6.50% 


4.8796e+4 


3.25e-l 


55.86% 



Fig. 8.4. Average relative errors (left) and runnting time (right) by APG-I using different samples 





SR = 


10% 




SR = 


30% 




SR = 


50% 




SR = 


100% 







SR 


= 10% 




SR 


= 30% 




SR 


= 50% 




SR 


= 100% 



dimension of core tensor 



dimension of core tensor 



iterations. When R > 17, the running time of APG with 10% samples decreases because it stops earUer. 

Comparison with Mult''. Secondly, we compare APG with Mult on the brain MRl image used above. 
The core dimension is set to Ri ^ R2 ~ Rj, = 30 and sparsity parameters to Ac = Ai = A2 = A3 = 0.5. We 
compare the two algorithms using SR = 10%, 30%, 50% uniformly randomly chosen samples and run them 
to imax = 600 (sec). Table 8.5 shows the average results at time t = 150,300,450,600 for different samples 
over 5 independent trials. From the table, we see that Mult fails with 10% samples while APG can still work 
reasonably. In all cases, APG performs better than Mult in both accuracy and speed. The solutions given 
by APG are sparser than those by Mult for SR = 30%, 50%. Again, APG-I performs better than APG-II 
with 10% samples, but APG-II becomes better than APG-1 with 30% and 50% samples. 

8.5. Sparse higher-order principal component analysis. We use a simple test with synthetic data 
to show that (7.3) can be better than unregularized HOPCA that sets all of Ac, Ai, • • • , Aat to zero in (7.2). 
We use the APG method described in Section 7 for (7.3) and HOOI [8] for the unregularized HOPCA. We 
set ijjj- = ||(A{;)jc(Aj;)Jc|| in (7.5) and w^j- in the same way as in (8.5). 

We generate a 50 x 50 x 50 tensor in the form of = C Xi Ai X2 A2 x A3 +J\f. Here, C is 3 x 3 x 3, 
and each element is drawn from standard Gaussian distribution. Then 60% components of C are selected 
uniformly at random and set to zero. Factor matrices have sparsity patterns shown in Figure 8.6, and each 
non-zero element is drawn from standard Gaussian distribution. Then each column is normalized. A/" is 
Gaussian random noise and makes the signal-to- noise-ratio SNR = 60. The sparsity parameters are set to 
Ac = Ai = A2 = A3 = 0.02 and orthogonality parameter /i = 0.1 in (7.3). The sparsity patterns of the 
original C and A and those^ given by APG are plotted in Figure 8.6. We see that the solution given by 

^Although Mult converges very slowly, it is the only one we can find that is also coded for sparse nonnegative Tucker 
decomposition with missing values. 

^We permute the columns of the factor matrices and do permutations to the core tensor accordingly. 
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Fig. 8.5. Average relative errors (left) and runnting time (right) by APG-II using different samples 




dimension of core tensor dimension of core tensor 

Table 8.5 

Average results by APG and Mult on a brain MRI image from different samples 





APG-I 


APG-II 


Mult 


time 


obj. 


rol. err. 


fac. don. 


coro don. 


obj. 


rol. orr. 


ac. don. 


coro don. 


obj. 




rol. orr. 


fac. don. 


coro don. 


SR = 10% 


150 


1.22210+3 


1.99e-l 


25.81% 


1.79% 


1.14590+3 


2.10e-l 


24.61% 


0.70% 


1.5600O- 


h4 


l.OOo+O 


0.00% 


0.00% 


300 


8.37510+2 


1.44e-l 


17.94% 


1.40% 


8.63470+2 


1.76e-l 


14.66% 


0.89% 


1.5600O- 


h4 


l.OOo+O 


0.00% 


0.00% 


450 


7.3223e+2 


1.27e-l 


16.01% 


1.56% 


7.9015O+2 


1.67e-l 


13.72% 


1.03% 


1.5600O- 


h4 


l.OOo+O 


0.00% 


0.00% 


600 


6.8072e+2 


1.19e-l 


15.44% 


1.74% 


7.5803O+2 


1.63e-l 


13.21% 


1.14% 


1.5600O- 


h4 


l.OOo+O 


0.00% 


0.00% 


SR = 30% 


150 


1.77190+3 


1.22C-1 


31.48% 


3.77% 


1.27790+3 


l.lOe-1 


26.95% 


2.83% 


2.47690- 


h3 


1.90O-1 


31.33% 


44.30% 


300 


1.14880+3 


8.52C-2 


23.06% 


4.50% 


7.52860+2 


6.65e-2 


18.24% 


5.91% 


1.9551e- 


h3 


I.6O0-I 


28.27% 


40.61% 


450 


9.1865e+2 


6.980-2 


20.50% 


5.59% 


6.21820+2 


5.20e-2 


16.14% 


7.85% 


1.6478e- 


h3 


1.390-1 


25.80% 


35.69% 


600 


8.1261e+2 


6.28e-2 


18.35% 


6.29% 


5.79790+2 


4.72e-2 


15.19% 


8.63% 


1.4545e- 


h3 


1.24e-l 


23.93% 


32.70% 


SR = 50% 


150 


2.11780+3 


9.890-2 


32.71% 


5.65% 


1.43110+3 


8.67e-2 


28.00% 


4.66% 


3.55980- 


h3 


1.840-1 


31.75% 


43.23% 


300 


1.2702e+3 


6.34e-2 


24.69% 


7.36% 


7.8603O+2 


4.84e-2 


19.39% 


9.27% 


2. 73650- 


h3 


1.540-1 


29.11% 


44.59% 


450 


9.7654e+2 


4.970-2 


22.04% 


8.96% 


6.39670+2 


3.71e-2 


17.22% 


11.32% 


2.2749e- 


h3 


1.330-1 


26.81% 


45.33% 


600 


8.37890+2 


4.190-2 


20.28% 


10.07% 


6.0024O+2 


3.41e-2 


16.41% 


11.97% 


1. 98820- 


h3 


1.20C-1 


25.02% 


44.11% 



APG have almost the same sparsity pattern as the original ones. To see how close to orthogonality each 
factor matrix is given by APG, we first normalize each column of the factor matrices and then calculate 
||AXA.„ - I||f/||I||f, which are 2.95 x 10-3,1.36 x 10-^,7.24 x IQ-^, respectively for n = 1,2,3. Hence, 
they are almost orthogonal. Though the solution by HOOI makes a relatively higher data fitting, it is highly 
dense with no zero element. Therefore, the solution given by (7.3) is potentially better than that of (7.2) 
for some applications such as classification. 

9. Discussions. Sparse NTD aims at decomposing a tensor into the product of a core tensor and 
some factor matrices with nonnegativity and sparsity constraints. Existing algorithms for this problem 
either converge rapidly with very expensive per-iteration cost or have low per-iteration cost with very slow 
convergence speed. We have proposed two versions of APG methods, which own both low per-iteration 
complexity and fast convergence speed. Moreover, the algorithms have been modified for sparse NTD from 
partial observations of a target tensor. The modified algorithms also have low per-iteration cost and can 
give similar decompositions from half of or even fewer observations as those from full observations. 
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Fig. 8.6. Sparsity pattern of the orginal C and A and those given by APG method 



Original A APG's A 



Original C 



APG's C 



Further speed up. One way to further improve the convergence speed of Algorithm 2 is to choose a 
possibly smaller L^'" rather than a Lipschitz constant of Vc^(C, A*^^^, A^^^^) such that 

, . T k,n I 

£(C'='",A^^<„,A^^>i)<^(C '",A^^<„,A^^>i) + (Vc^(C,A^^<„,Ay),C'='"-C ^") + ^||c'='"-C ''"H^^, (9.1) 

where C'^'" is obtained from (5.1). One can achieve such an L^'" in a dynamic way. First, initialize it to 
some value, say L^'" = 0.5L^'"~^. Then do the update (5.1) and check whether (9.1) holds. If it holds, then 
C*^'" is accepted and continute. Otherwise, increase Ljf'" to r/Lj?'" for some rj > 1 and repeat the update 
(5.1) until (9.1) holds. As L^'" is a Lipschitz constant of Wc^iC, Aj^^, A^^^), the inequality in (9.1) must 
hold, so the above procedure will terminate in finite steps. Similarly, one can dynamically choose a possibly 
smaller rather than a Lipschitz constant of Va„^(C'''", A*^^„, A„, A^~;^). Smaller LJ:-" and can often 
speed up the convergence of the algorithm. However, more computation is required to check inequalities 
such as the one in (9.1). Hence, how to initialize L^^" and to balance the speed improvement and extra 
computation is a key point. We leave these to interested readers. 

Acknowledgements. This work is partly supported by ARL and ARO grant W911NF-09-1-0383 and 
AFOSR FA9550-10-C-0108. The author would like to thank Prof. Wotao Yin for his valuable discussions 
and also thank Anh Huy Phan for sharing the code of HALS. 

Appendix A. Proof of Theorem 5.1. We need the following lemma. 

Lemma A.l (Lemma 2.1 of [37]). Let ^i(u) and ^2(u) be two convex functions defined on a convex set 
hi. Suppose Ci(u) is differentiable and has Lipschitz continuous gradient. Let £,{u) — i^i(u) + C2(u) and 

u* = argmin(VCi(v),u- v) + ^||u - v|p + ^2(u), 

where L is a Lipschitz constant o/V^i(u), then 

e(u)-^(u*)> ||lu*-v|l2 + L(v-u,u*-v), for anyueU. 
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A.l. Subsequence convergence. First, we give a subsequence convergence result, namely, any limit 
point of {W''} is a stationary point. Using Lemma A.l, we have 

>^\\C'"''' _ C'-'"|||, + L^" (C^'" - (,k,n-l^^k,n _ 
T k.n T k,n 

^^^\\C'''"^^ - C'^^'^Wl - =^(wf'")2||c'''""^ -C'''""^|||, (A.l) 

Tk,n Tk,n—1 
>_^|H,fc,«-l _^fc,„|j| _ ^_j2|j^fc,n~2 _^fc,«-l|||^ (^ 2) 



where we have used w^'" < y ^l*'-" S*^^ ^^"^ ^^^^ inequality. Note that if the re-update in Line ReDo-II 
is performed, then oj^'^" = in (A.l), and (A. 2) still holds. Similarly, we have 

F(C'='", A}<„, A^) - F(C'='", Aj<„, A^^-i) > ^HA^^ - A^;!!^ - ^^IWA^-^ - A^-^W^. (A.3) 

Summing (A.2) and (A.3) together over n and noting C'''^^ = c^^-i'^-i, C'''" = q^-^'^ yield 



> 



^ / Th,n rk,n — l rk r k — 1 

El \\^k,n-l ^k,n\\2 s2\\^k,n-2 ^k,n — l\\2 , ^n\\\k — l \ k \\2 <:2\\\k-2 Afc-l||2 

-C ■ \\p — 5^\\C ' -C ' \\p + —-\\A„ - A„\\p — <5^||A„ - A„ y 



rk.N rfe-l,]V -'^"1 1-1 s:2\rk,n 



2 

n=l 

-'V / r fe r fc-1 

+ ^(^||Ari-Afjl|-^5S|lAr-Ar^||| 

n = l ^ 

Summing (A.4) over fe, we have 

K N 



k=l n=l 

/S' N 



k=l n=l 

Letting if — > cx) and observing F is lower bounded, we have 

oo N 

E E (ll^"'""' - ^'-"11^ + - <\\f) < ^- (A.6) 

fc=l n=l 

Suppose VV = (C, Ai, • • • , Aa?) is a hmit point of {W'''}. Then there is a subsequence {W*^ } converging 
to VV. Since {LJ:'",LJ;} is bounded, passing another subsequence if necessary, we assume L^' Z" and 
L„. Note that (A.6) imphes A'''-'^ A and C™'" C for all n and m = k' , k' - 1, A:' - 2, as fc ^ oo. 

k .n — 

Hence, C — > C for all n, as A: — oo. Recall that 

C''' " = argmin^Vc^(c'' ", A,^;„, A^^>-J),C - c'' + ^\\C - c'' "||| + A,||C||i. (A.7) 

Letting k ^ oo and using the continuity of the objective in (A.7) give 

_ _ 
C = argmin (Vc£(C, A),C-C) + ^\\C- C\\l + A,||C||i. 
c>o ^ 
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Hence, C satisfies the first-order optimality condition 

(Vc^(C,A) + Ac7'c,C-C) >0, forallC>0, some Pc G 5||C||i (A.8) 

Similarly, we have 

(VaJ(C, A) + A„P„, A„ - A„) > 0, for ah A„ > 0, some P„ e a||A„||i, for aU n. (A.9) 

Note (A.8) together with (A.9) gives the first-order optimality conditions of (4.1). Hence, VV is a stationary 
point. 

A. 2. Global convergence. Next we show the entire sequence {W'"} converges to a limit point VV. 
Since all Ac, Ai, • • • , Ajv are positive, the sequence {W'^} is bounded and admits a finite limit point VV. Let 
E = {yV : \\W\\f < 4:iy}, where ||W||_f = V||C|||, + ||A|||, and is a constant such that \\{C'''" , A'')\\f < v 
for all fc,n. Let Lq be a uniform Lipschitz constant of Vc^(W) and VA„^(VV),n = I,-- - ,Af, over 
namely, 

||Vc^(y) - Vc^(^)|lf <LG\\y - Z\\f, yy,Z e E, (A.lOa) 
\NAAy)-'^Aj{2)\\F<LG\\y~2\\F, yy,ZeE, Vn, (A.lOb) 

Let 

N 

H{C,A) = e{C,A) + AcllClli + S+{C) + ;^(A„|lA„|li + <5+(A„)) 

n=l 

be the objective of (4.18) and 

re(C) = Ae||C||i +(5+(C), r„(A„) = A„||A„||i +(5+(A„), n = l,---,N. 

Recall that H satisfies the KL property (3.4) at W, namely, there exist 7,p > 0, 6* G [0, 1), and a neighbor- 
hood B(yV,p) = {W : II W - VVIIf < p} such that 

|i7(>V) -i?(VV)|'' < 7-dist(0,ai/(W)), for all W G B(VV, p). (A.ll) 

Denote Hk = H{W'') - H{W). Then Hk i 0. Since VV is a limit point of {W''} and || A'= - A'=+i||i. ^ 
Q |j^fc,n-i _ c'''"\\f for all k,n from (A. 6), for any T > 0, there must exist fco such that G 
B{W,p),j ^ ko,ka + l,fco + 2 and 

TiHl;" + WA'"' - A'">+^\\f + IIA'^o+i - A'^-o+^ll^ + ||c'=°+2.^-i - c'^-'+^'^lli.) + ||W''°+^ - VV||f < P- 

Take T as specified in (A. 23) and consider the sequence {'W'^}k>koi which is equivalent to starting the 
algorithm from VV*^" and, thus without loss of generality, let fco = 0, namely, W-' e S(VV, p),j — 0, 1, 2, and 

T{Hy + ||A" - A^IIf + ||Ai -A^F + IIC^'^-i - C^'^\\f) + ||W' - VV||f < P- (A.12) 

The idea of our proof is to show 

W'' e B(VV,/9), for aU k, (A.13) 

and employ the KL inequality (A.ll) to show {W*^} is a Cauchy sequence, thus the entire sequence converges. 
Assume W'' G B(VV, p) for < fc < K. We go to show W^+^ € -B(VV, p) and conclude (A.13) by induction. 
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Note that 

dHiW") = {dn{Al) + Vai^(W'-)} X • • • X {drN{A%) + Va„^(W")} x {5r,(C^-'") + Vc W')} 
and 

-L:;(A:; - Afj - VAjiC'-'\ A^^<„, Al Ay ) + Va„ W^) ear„(A:;) + Va Vn, fc, 

Hence, for all k < K, 

AT 

+ Ell^A,/(C*•^A)<„,A^;,Ay)-VA„f(W*)||^ + ||Vc^(c'•'^,A,^^,A^,-')-Vc^(VV'■■)i|^ 



<L„(||A^- - A'^-^llf + ||A*-^ - A*=-2||f) +L4||C'''^ -C^'^-^F + Hc^'^^-i - C^'^-'l 



+ ^ LcdlC"'" - C^'^^Wf + IIA" - A'=-'||f + IIA*^-' - A^-'^Wf) 

n = l 

+ LcdlC"''^ -C'-^-^F + ||C"'~-^ -C"'^-'||f + ||A* - A'=-^||f) 
<{Lu + iN + 1)Lg) |^||A'= - A'^-'IIf + ||A'=-' - A'^-'^Wf + \\C''''^ - C^'^^-^f + ||C'='"-' - C'^'^Hf^ , (A.14) 

where we have used L^, i^'" < L„, Vfc, n and (A. 10) to have the second inequahty, and the third inequaUty 
is obtained from IjC'^'" - C'''^\\f < J^fjn 11^^'''' " C^'^+^Wf and doing some simphfication. Using the KL 
inequahty (A. 11) at W = W*' and the inequahty 



we get 

^dist(0,ai/(W'=))(Hr^ - Hl-f) >H,- (A.15) 

By (A. 4), we have 
Hk — Hk+i 

Tk + l.N Tk,N 



l-"- "w7^c ii^fc+l/n-l ^fc+l.ji||2 , 'S~^^7i lltk Afc+l||2 ^ 



, fc ||2 



n=l n=l ^ 

Combining (A.14), (A.15), (A.16) and noting > Ld yield 



+ ^^'^"^-^^ I|C''+^'"-' - C^+^'-lll (A.17) 
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2 



By Cauchy-Schwart inequality, we estimate 
\/right side of inequality (A. 17) 



1 + 5^ 



JV-l 

+ vYl ||C'=+''""' ~ C'=+''"||f, (A.18) 

n=l 

where 77 > is sufficiently small and depends on di^,Ld,N, and 
\/left side of inequality (A. 17) 

< _ 0) (^fe -i^fe+i) (A-19) 



where /i > is a sufficiently large constant such that 

_<nnn(,y,^^-). 

Combining (A. 17), (A. 19), (A.18) and summing them over k from 2 to K give 

IJ,-y{Lu + {N + 1)Lg) ,rri-0 rri-e s 
4(1^6^ "^^'+1) 

X / JV-l 
fc=2 V n = l 



+ E ||(V^Ar \ . . . , ^^A)i\ /lFC^-^-I) - (/Lf Aj, . . . , \/l^A^, /lS^C^- 



•"11. 

fc=2 



fe = 2 
K AT-l 

fc = 2 n = l 

Simplifying the above inequality, we have 



4(1-61) 11 



2,JV^2,JV\j| 



+ E I|(\AFaL . . . , V^A^, V^L™C^+^'^-^) - (V^Ar\ ■ • • , V^A^+\ 

k = 2 
K JV-l 

+ (^--)EE iic*+''"-' -c'=+''"iif. (A.20) 
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Note that 



N 



fc+l||Afe_ Afe+l||2 _|_ ^fc+l,W||/^A;+l,7V-l _ ^fe+l,Ar||2 



n=l 

>Ld(||A'=- A^-+i||| + ||C 
>^(||A''-A^-+i||^ + ||C 



k+l,N-l yyk+l.Nn \2 



Plugging (A. 21) to inequality (A. 20) gives 

+ 5.11 ( v^aL • ■ ■ , v^Ajv, v^c^'^-^) - (y^A?, . . . , y^A^^, y^C^''^)!!^ 



(A21) 



> - '^"^ W^dlA^'-' - A^+^||f + ||C^+1>^-1 

+ i^A/^^'dlA^- - A'-'+l^ + \\e'+''''-' 

fc=2 
X JV-1 



which implies by noting Hq > Hu > 0, C''+^'° = and < L„, Vfc,n that 



^i-/{L^ + iN + l)LG) 1 

J^n H 



> 



4(1 -0) fi 

+ <5. v^dl Ai - A^ll^ + IIC^'^-i - C^^^ll^) 



2||Ai - A^IIj. + ||A" - AHIj. + IIC^'^ - C^'^-ii 



^(l|A^ 



A^'+i||f. + ||C^'+i'^-i-C^+i'^| 



^ fc=2 



(^_l)^||C'=,^_C'=+i 



,Af-l| 



fc=2 



>t(||A^ - A^^+i||f + WC"'^ - C^'+I'^IIf) + t ^ (||A^- - A'^+'Wf + WC"-^ - C'+'-^Wf), (A.22) 



where 



Let 



k=2 



. 1-5^ La 2 1 

r = mm \ / , ri 

I 2 V 2 M M 



T = max 



/Ai7(iu + (iV + l)LG) 1 , 5^ 



\ 4r(l 



2/ir r 



Then (A.22) implies 



T{Hl-<' + ||A" - A^llf + ||Ai -A^F + IIC^-^-i - C^^^IIf) 



k=2 



(A.23) 



(A.24) 
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from which we have 



K-l 
fe=2 

<T{H^-'> + IIA" - A^IIf + ||Ai - A^IIj, + HC^^^-^ - C^^^H^) + - VV||f < P- 

Hence, W^+^ G B(VV, p). By induction, we have W'' G B(yV, p) for all k, so (A.24) holds for all K. Letting 
K ^ oo gives Y^^=2 IIVV'' — W'^^^Hf < oo, namely, {W''} is a Cauchy sequence and, thus W'' converges. 
Since VV is a Hmit point of {W*^}, then W'"' — > VV. This completes the proof. 
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